Skip single file in a partition from OPTIMIZE #23864

raunaqmorarka · 2024-10-22T09:02:22Z

Description

Single file without a delete in a partition can't be optimized any further.
Avoiding rewriting such files improves the performance of OPTIMIZE on
large partitioned tables

Additional context and related issues

Fixes #10785

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Iceberg
* Improve performance of OPTIMIZE on large partitioned tables. ({issue}`10785`)

findinpath

Only cosmetics spotted.
I appreciate seeing the testing coverage for the newly added functionality.

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java

findinpath · 2024-10-22T18:20:05Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java

        List<Object> splitsInfo = ImmutableList.copyOf(scannedFiles.build());
+        log.info("Generated %d splits, skipped %d files for OPTIMIZE", splitsInfo.size(), filesSkipped);


Why not debug?

I think it's useful enough to be at info, until we enhance optimize to print some useful summary at the end of execution. We get a bunch of useless logs out of iceberg library anyway in IcebergSplitSource, this isn't adding much to the noise.

findinpath · 2024-10-22T18:29:13Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java

    private ImmutableSet.Builder<DataFileWithDeleteFiles> scannedFiles = ImmutableSet.builder();
+    @Nullable
+    private Map<StructLikeWrapperWithFieldIdToIndex, Optional<FileScanTaskWithDomain>> scannedFilesByPartition = new HashMap<>();


I would argue that there is no need to log how many files were skipped.
Consider using a local variable in the method instead of adding more state to the class.

This state is required regardless of the log line. We need to maintain scannedFilesByPartition for the entire lifetime of the split source as file tasks iteration is not guaranteed to be partition at a time.
Having the log line is useful for knowing what happened. We should probably extend OPTIMIZE to print some useful stats like this at the end of the execution.

losipiuk · 2024-10-24T08:43:01Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java

@@ -304,6 +277,54 @@ private CompletableFuture<ConnectorSplitBatch> getNextBatchInternal(int maxSize)
        return completedFuture(new ConnectorSplitBatch(splits, isFinished()));
    }

+    private Iterator<FileScanTaskWithDomain> prepareFileTasksIterator(List<FileScanTaskWithDomain> fileScanTasks)


you are materializing FileScanTaskWithDomain so just return LIst< FileScanTaskWithDomain> here

It's more convenient to return iterator as the rest of the existing code is written to work on an iterator

losipiuk · 2024-10-24T08:44:23Z

...trino-iceberg/src/main/java/io/trino/plugin/iceberg/StructLikeWrapperWithFieldIdToIndex.java

+import java.util.Objects;
+import java.util.stream.IntStream;
+
+public class StructLikeWrapperWithFieldIdToIndex


nice class name

credit to @homar ;)

...trino-iceberg/src/main/java/io/trino/plugin/iceberg/StructLikeWrapperWithFieldIdToIndex.java

losipiuk

A bit hard to read - but logic seems fine

Keeping domain attached to the relevant FileScanTask is safer than handling it as member variable of the class

This will be used in subsequent commit to skip unncessary files from optimize

Single file without a delete in a partition can't be optimized any further

cla-bot bot added the cla-signed label Oct 22, 2024

raunaqmorarka requested a review from alexjo2144 October 22, 2024 09:02

github-actions bot added the iceberg Iceberg connector label Oct 22, 2024

raunaqmorarka requested review from homar, findinpath, findepi, losipiuk and ebyhr October 22, 2024 09:02

raunaqmorarka added the performance label Oct 22, 2024

findinpath reviewed Oct 22, 2024

View reviewed changes

anusudarsan self-requested a review October 22, 2024 18:44

raunaqmorarka force-pushed the iceberg-opt-skip branch from 04ebc29 to f8684a0 Compare October 23, 2024 14:02

raunaqmorarka requested a review from findinpath October 23, 2024 14:08

raunaqmorarka force-pushed the iceberg-opt-skip branch from f8684a0 to 6549eb1 Compare October 23, 2024 14:37

losipiuk reviewed Oct 24, 2024

View reviewed changes

...trino-iceberg/src/main/java/io/trino/plugin/iceberg/StructLikeWrapperWithFieldIdToIndex.java Show resolved Hide resolved

losipiuk approved these changes Oct 24, 2024

View reviewed changes

raunaqmorarka added 5 commits October 25, 2024 19:48

Avoid fileStatisticsDomain as member variable in IcebergSplitSource

9bcd085

Keeping domain attached to the relevant FileScanTask is safer than handling it as member variable of the class

Refactor IcebergSplitSource to allow processing of multiple FileScanTask

6032b35

This will be used in subsequent commit to skip unncessary files from optimize

Extract StructLikeWrapperWithFieldIdToIndex to separate class

ec5db34

Add StructLikeWrapperWithFieldIdToIndex#createStructLikeWrapper

b2c7e6e

Skip single file in a partition from OPTIMIZE

84ac714

Single file without a delete in a partition can't be optimized any further

raunaqmorarka force-pushed the iceberg-opt-skip branch from 6549eb1 to 84ac714 Compare October 25, 2024 14:32

raunaqmorarka merged commit 75fa7cb into trinodb:master Oct 25, 2024
45 checks passed

raunaqmorarka deleted the iceberg-opt-skip branch October 25, 2024 15:22

github-actions bot added this to the 464 milestone Oct 25, 2024

mosabua mentioned this pull request Oct 25, 2024

Add Trino 464 release notes #23881

Merged

raunaqmorarka mentioned this pull request Nov 10, 2024

Delete files are not removed after running Iceberg maintenance ops #24086

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip single file in a partition from OPTIMIZE #23864

Skip single file in a partition from OPTIMIZE #23864

raunaqmorarka commented Oct 22, 2024

findinpath left a comment

findinpath Oct 22, 2024

raunaqmorarka Oct 23, 2024

findinpath Oct 22, 2024

raunaqmorarka Oct 23, 2024

losipiuk Oct 24, 2024

raunaqmorarka Oct 25, 2024

losipiuk Oct 24, 2024

raunaqmorarka Oct 25, 2024

losipiuk left a comment

		List<Object> splitsInfo = ImmutableList.copyOf(scannedFiles.build());
		log.info("Generated %d splits, skipped %d files for OPTIMIZE", splitsInfo.size(), filesSkipped);

Skip single file in a partition from OPTIMIZE #23864

Skip single file in a partition from OPTIMIZE #23864

Conversation

raunaqmorarka commented Oct 22, 2024

Description

Additional context and related issues

Release notes

findinpath left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

losipiuk left a comment

Choose a reason for hiding this comment